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Abstract 

I describe an exploration criterion that attempts to minimize the error of a learner by minimizing its 
estimated squared bias. I describe experiments with locally-weighted regression on two simple kinematics 
problems, and observe that this “bias-only” approach outperforms the more common “variance-only” 
exploration approach, even in the presence of noise. 
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1 Introduction 

In recent years, there has been an explosion of interest 
“active” machine learning systems. These are learning 
systems that make queries, or perform experiments to 
gather data that are expected to maximize performance. 
When compared with “passive” learning systems, which 
accept given, or randomly drawn data, active learners 
have demonstrated significant decreases in the amount 
of data required to achieve equivalent performance. In 
industrial applications, where each experiment may take 
days to perform and cost thousands of dollars, a method 
for optimally selecting these points would offer enormous 
savings in time and money. 

An active learning system will typically attempt to 
select data that will minimize its predictive error. The 
error of a learner can be decomposed into bias and vari¬ 
ance terms. Most research in selecting optimal actions 
or queries has assumed that the learner is approximately 
unbiased, and that to minimize learner error, variance is 
the only thing to minimize (a few examples include Fe¬ 
dorov [1972], MacKay [1992], Cohn [1994; 1995], Paass 
[1995]). In practice, however, there are very few prob¬ 
lems for which we have unbiased learners. Frequently, 
bias constitutes a large portion of a learner’s error; if 
the learner is deterministic and the data are noise-free, 
then bias is the only source of error. 1 

In this paper I describe an algorithm which selects 
actions/queries designed to minimize the bias of a lo¬ 
cally weighted regression-based learner. Empirically, 
“variance-minimizing” strategies which ignore bias seem 
to perform well, even in cases where, strictly speaking, 
there is no variance to minimize. In the tasks considered 
in this paper, the bias-minimizing strategy consistently 
outperforms variance minimization, even in the presence 
of noise. 

1.1 Bias and variance 

Let us begin by defining P(x, y) to be the unknown joint 
distribution over x and y, and P(x) to be the known 
marginal distribution of x (commonly called the input 
distribution). We denote the learner’s output on input 
x, given training set V as y(x]T>). We can then write 
the expected error of the learner as 

f E \(y(x;V) - y(x)) 2 \x\ P(x)dx, (1) 

J X L J 

where E[-] denotes the expectation over P and over train¬ 
ing sets V. The expectation inside the integral may be 
decomposed as follows (Geman et ah, 1992): 

E ^(y(x;V) - y{x)f |*] = (2) 

E [(y(x) - E[y\x]fj (3) 

+ (E V [y(x;V)\ - E[y\x]f 
+E V [(y(x-V) - E v [y(x-V)]f] 

J The bias term here is a statistical bias, which is distinct 
from the inductive bias discussed in some machine learning 
research. See Dietterich and Kong [1995] for a discussion of 
the relationship between the two. 


where E-p[-\ denotes the expectation over training sets. 
The first term in Equation 2 is the variance of y given x 
- it is the noise in the distribution, and does not depend 
on our learner or how the training data are chosen. The 
second term is the learner’s squared bias, and the third is 
its variance ; these last two terms comprise the expected 
squared error of the learner with respect to the regression 
function Aft/1*]. 

Most research in active learning assumes that the sec¬ 
ond term of Equation 2 is approximately zero, that is, 
that the learner is unbiased. If this is the case, then 
one may concentrate on selecting data so as to minimize 
the variance of the learner. Although this “all-variance” 
approach is optimal when the learner is unbiased, truly 
unbiased learners are rare. Even when the learner’s rep¬ 
resentation class is able to match the target function 
exactly, bias is generally introduced by the learning al¬ 
gorithm and learning parameters. From the Bayesian 
perspective, a learner is only unbiased if its priors are 
exactly correct. 

The optimal choice of query would, of course, mini¬ 
mize both bias and variance, but I leave that for future 
work. For the purposes of this paper, I will only be con¬ 
cerned with selecting queries that are expected to min¬ 
imize learner bias. This approach is justified in cases 
where noise is believed to be only a small component 
of the learner’s error. If the learner is deterministic and 
there is no noise, then strictly speaking, there is no error 
due to variance — all the error must be due to learner 
bias. In cases with non-determinism or noise, all-bias 
minimization, like all-variance minimization, becomes an 
approximation of the optimal approach. 

The learning model discussed in this paper is a form 
of locally weighted regression (LWR) [Cleveland et ah, 
1988], which has been used in difficult machine learning 
tasks, notably the “robot juggler” of Schaal and Atkeson 
[1994]. Previous work [Cohn et ah, 1995] discussed all¬ 
variance query selection for LWR; in the remainder of 
this paper, I describe a method for performing all-bias 
query selection. Section 2 describes the criterion that 
must be optimized for all-bias query selection. Section 3 
describes the locally weighted regression learner used in 
this paper and describes how the all-bias criterion may 
be computed for it. Section 4 describes the results of ex¬ 
periments using this criterion on several simple domains. 
Directions for future work are discussed in Section 5. 

2 All-bias query selection 

Let us assume for the moment that we have a source of 
noise-free examples ( Xi,yi ) and a deterministic learner 
which, given input x, outputs estimate y(x). 2 Let us 
also assume that we have an accurate estimate of the 
bias of y which can be used to estimate the true func¬ 
tion y(x) = y(x) — bias(x). We will break these rather 
strong assumptions of noise-free examples and accurate 
bias estimates in Section 4, but they are useful for de¬ 
riving the theoretical approach described below. 

2 For clarity, I will drop the argument x except where re¬ 
quired for disambiguation. I will also denote only the uni¬ 
variate case; the results apply in higher dimensions as well. 



Given the accurate bias estimate, our task is then to 
force the biased estimator into the best approximation of 
y(x) with the fewest number of examples. This, in effect, 
transforms the query selection problem into an example 
filter problem similar to that studied by Plutowski and 
White [1993] for neural networks. Below, I derive this 
criterion for estimating the change in error at x given a 
new queried example at x. 

Since we have (temporarily) assumed a deterministic 
learner and noise-free data, the expected error in Equa¬ 
tion 2 simplifies to: 

E ^(y(x; V) - y{x)f \x, X>] = (y(x; V) - y{x)f { 4) 

We want to select a new x such that when we add 
(x, y), the resulting squared bias is minimized: 

(y - yf = (y(x;Vl> (x, yf - y{x)f . (5) 

We will, for the remainder of the paper, use the to 
indicate estimates based on the initial training set plus 
the additional example (x,y). To minimize Expression 5, 
we need to compute how a query at x will change the 
learner’s bias at x. If we assume that we know the input 
distribution, 3 then we can integrate this change over the 
entire domain (using Monte Carlo procedures) to esti¬ 
mate the resulting average change, and select a x such 
that the expected squared bias is minimized. Defining 
bias = y — y and Ay = y' — y, we can write the new 
squared bias as: 


and we could concentrate solely on variance, as previous 
work has. 

The answer to this question has several parts, the first 
of which is that for most learners, there are no perfect 
bias estimators. Bias estimators introduce their own bias 
and variance, which must be addressed in data selection. 

We can define a composite learner which produces es¬ 
timate y c = y — bias. Given a random training sample 
then, we would expect y c to outperform y. However, 
there is no obvious way to select data for this composite 
learner other than selecting to maximize the performance 
of its two components. In our case, the second compo¬ 
nent (the bias estimate) is non-analytic, which leaves 
us selecting data so as to maximize the performance of 
the first component (the uncorrected estimator). We 
are now back to our original problem: we can select 
data so as to minimize either the bias or variance of 
the uncorrected LWR-based learner. Since the purpose 
of the correction is to give an unbiased estimator, intu¬ 
ition suggests that variance minimization would be the 
more sensible route in this case. 

Regardless of how we select our data, we can use the 
composite estimator to make our predictions; depending 
on how noisy the bias estimate is, this may or may not 
improve the learner’s net performance. In the domains 
considered in this paper, I found that the performance of 
y c using random selection or variance minimization was 
not substantially different from that of the uncorrected 
y (see Figure 7 in Section 4). 


bias' = (y' - yf 

= (y + Ay-yf 

= Ay 2 + 2A y ■ bias + bias 2 (6) 

Note that since bias as defined here is independent of x, 
minimizing the bias is equivalent to minimizing Ay 2 + 
2A y ■ bias. 

The estimate of bias' tells us how much our bias will 
change for a given x. We may optimize this value over x 
in one of a number of ways. In low dimensional spaces, 
it is often sufficient to consider a set of “candidate” x 
and select the one promising the smallest resulting error. 
In higher dimensional spaces, it is often more efficient to 
search for an optimal x with a response surface technique 
[Box and Draper, 1987], or hillclimb on dbias'/dx. 

Estimates of bias and Ay depend on the specific learn¬ 
ing model being used. In Section 3, I describe a locally 
weighted regression model, and show how differentiable 
estimates of bias and Ay may be computed for it. 

2.1 An aside: why not just use y — bias? 

If we have an accurate bias estimate, it is reasonable to 
ask why we do not simply use the corrected y — bias 
as our predictor. Certainly, in the limit of a perfect bias 
estimate, the composite prediction would have zero bias, 

3 This assumption is contrary to the assumption normally 
made in some forms of learning, e.g. PAC-learning, but it is 
appropriate in many domains. If, for example, we are learn¬ 
ing to control a robot arm, it is reasonable to assume that 
we know the distribution of positions over which we are in¬ 
terested in controlling it. 


3 Locally weighted regression 


The type of learner I consider here is a form of locally 
weighted regression (LWR) that is a slight variation on 
the LOESS model of Cleveland et al. [1988]. The LOESS 
model performs a linear regression on points in the data 
set, weighted by a kernel centered at x (see Figure 1). 
The kernel shape is a design parameter: the original 
LOESS model uses a “tricubic” kernel; in my experi¬ 
ments I use the more common Gaussian 


hi(x) = h(x — xf = exp(— k(x — xf 2 ), 
where A; is a smoothing parameter. For brevity, I will 
drop the argument x for hfx), and define n = JAM 
We can then write the weighted means and covariances 
as: 

(xi - xf 
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We use these means and covariances to produce an esti¬ 
mate y at the x around which the kernel is centered, with 
a confidence term in the form of a variance estimate: 
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In all the experiments discussed in this paper,; the 
smoothing parameter k was set so as to minimize <r|. 
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Figure 1: Locally weighted regression places a kernel 
around the point of interest x. The kernel is used to as¬ 
sign weightings to points in the training set, from which 
y(x) is computed via linear regression. 



Best local linear fit Best local quadratic fit 



Figure 2: Box and Draper’s method of estimating bias 
measures the difference between the estimator in ques¬ 
tion and a one-higher-order estimate. 


The low cost, of incorporating new training examples 
makes this form of locally weighted regression appealing 
for learning systems which must operate in real time, 
or with time-varying target functions (e.g. [Scha.a.l and 
Atkeson 1994]). 

3.1 Computing Ay for LWR 

If we know what new point (x,y) we’re going to add, 
computing Ay for LWR is straightforward. Defining h 
as the weight given to x, we can write 


= ■ Sh-'- /'••• /', /' * 

QX "X 

n + h cri 


+ x - ~~~l -r 

\ n + h n + h J 

{n + h)tj xy + h.-(x - y x ){y - y y ) 

(n + h)a~ + h ■ (x - ) 2 ^~' 


In the query selection problem, we must be able to 
estimate the bias at all possible x. There are several ways 
we can get this estimate using LWR. Box and Draper 
[1987] suggest fitting a higher order model and measuring 
the difference. In the case of (linear) locally weighted 
regression, one would fit. a. locally quadratic regressor to 
the data, and use the difference in estimates as the bias 
(see Figure 2). Under certain conditions on the higher- 
order bias terms, one can make some guarantees on the 
accuracy of this bias estimate. 

The disadvantages of this method stem from the fact, 
that it requires a. higher order model. This requires ad¬ 
ditional computation to fit, and the fit is more prone 
to variance problems. For the experiments described in 
this paper, this method of bias estimation yielded poor 
results; two other bia.s-est.ima.t.ion techniques, however, 
performed very well. 

3.2.1 Estimating bias by bootstrapping 
residuals 


Note that computing Ay requires us to know both the 
£ and y of the new point. In practice, we only know 
x. If we assume, however, that, we can estimate the 
learner’s bias at any x, then we can also estimate the 
unknown value y x y(x) — bias(x). Below, I consider 
how to compute the bias estimate. 

3.2 Estimating bias for LWR 

The most common technique for estimating bias is cross- 
validation. Standard cross-validation however, only 
gives estimates of the bias at our specific training points, 
which are usually combined to form an average bias es¬ 
timate. This is sufficient if one assumes that, the train¬ 
ing distribution is representative of the test distribution 
(which it isn’t in query learning) and if one is content 
to just estimate the bias where one already has training 
data, (which we can’t, be). 


Another method of estimating bias is by bootstrap¬ 
ping the residuals of the training points. Based on the m 
available training points, and the predictor’s fit to these 
points, a. “bootstrap sample” is created by randomly 
drawing m values with replacement, from the learner’s 
residuals. These values are added to the original pre¬ 
dictions to create a. synthetic training set on which the 
learner is retrained. 

By creating a. number of bootstrapped predictions and 
comparing their average prediction with that of the orig¬ 
inal predictor, one arrives at a. first-order bootstrap es¬ 
timate of the predictor’s bias [Connor 1993; Efron and 
Tibshira.ni, 1993]. It is known that this estimate is itself 
biased towards zero; a. standard heuristic is to divide the 
estimate by 0.632 [Efron, 1983]. A disadvantage of the 
bootstrap method is that, because it requires repeated 
fitting, it is computationally expensive. 



3.2.2 Estimating bias by fitting cross-validated 
estimates 

One may also estimate the bias of a learner by fit¬ 
ting its own cross-validated residuals. We first compute 
the cross-validated residuals on the training examples. 
These produce estimates of the learner’s bias at each 
of the training points. We can then use these residuals 
as training examples for another learner (again LWR) 
to produce estimates of what the cross-validated error 
would be in places where we don’t have training data 
(see Figure 3). 4 


o training examples 

-true function 

- estimator 



How do we estimate the bias at input x? 


Compute cross-validated residuals 



Use estimator (here, LWR) to fit residuals 


Figure 3: Applying the learner to its own cross-validated 
residuals produces an estimate of the bias that may be 
evaluated over the entire domain. 

4 Empirical results 

In the previous two sections, I have explained how having 
an estimate of Ay and bias for a learner allows one to 
compute the learner’s change in bias given a new query, 
and have shown how these estimates may be computed 
for a learner that uses locally weighted regression. Here, 
I apply these results to.s< w ml simple problems using the 
“Arm2D” domain (Figure 4) and demonstrate that they 
may actually be used to select queries that minimize the 
statistical bias (and the error) of the learner. 


4 One subtlety that needs to be addressed is which residual 
is actually fit. Denote the cross-validated estimate as y cv . If 
we believe the data is noise-free, then the true value of the 
function at x is y , so the cross-validated bias is y cv — y. If, 
however, there is noise, we should assume that some of that 
misfit is due to noise. In this case, the proper bias estimate 
should be y cv — y , with the remaining difference y — y being 
due to noise. 
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Figure 4: (left) Arm2D - The system learns arm kine¬ 
matics by specifying joint angles (, Oo) and observing 
tip coordinates (xi,xo). The goal is to minimize the 
MSE of the learner’s model over the input distribution 
(@i, @ 2 ) = (U[0, 27t], U[0, 7t]). (right) A sample explo¬ 
ration trajectory in joint-space for the constrained arm 
problem, exploring according to the cross-validation- 
based bias minimizing criterion. 


4.1 Bias estimates 

I tested the accuracy of the three bias estimators by ob¬ 
serving their correlations on 64 reference inputs, given 
100 random training examples from the Arm2D domain. 
When corrected with the 632 heuristic described above, 
both the bootstrap and cross-validation methods pro¬ 
duce fairly accurate, albeit noisy, bias estimates (Fig¬ 
ure 5). The quadratic method produced poor correlation 
and was dropped from the study. 
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Figure 5: Correlations between estimated and actual bi¬ 
ases for different estimators. 


4.2 Bias minimization 

I ran two series of experiments using the bias-minimizing 
criterion in conjunction with the bias estimation tech¬ 
nique of the previous section on the “Arm2D” domain. 
The bias minimization criterion was used as follows: At 
each time step, the learner was given a set of 64 ran¬ 
domly chosen candidate queries and 64 uniformly cho¬ 
sen reference points. It evaluated E'(x) for each refer¬ 
ence point given each candidate point and selected for 
its next query the candidate point with the smallest av¬ 
erage E'{x) over the reference points. I compared the 
bias-minimizing strategy (using the: bross-validation and 
bootstrap estimation techniques) against random sam¬ 
pling and the variance-minimizing strategy discussed in 
Cohn et, al. [1995]. On a Sparc 10, with m training ex¬ 
amples, the average evaluation times per candidate per 






reference point were 58+0.16m //seconds for the variance 
criterion, 65 + 0.53 m //.seconds for the cross-validation- 
based bias criterion, and 83 + 3.7 m //.seconds for the 
boot.st.ra.p-ba.sed bias criterion (with 20x resampling). 

To test, whether the bias-only assumption was robust, 
against, the presence of noise, 1% Gaussian noise was 
added to the input, values of the training data, in all ex¬ 
periments. This simulates noisy position effectors on the 
arm, and results in non-Ga.ussia.n noise in the output, co¬ 
ordinate system. 

In the first, series of experiments, the candidate points 
were drawn uniformly over (G[0, 2tt] , G[0,7r]). In uncon¬ 
strained domains like this, random sampling is a. fairly 
good default, strategy. The bias minimization strategies 
still significantly outperform both random sampling and 
the variance minimizing strategy in these experiments 
(see Figure 6). 



Figure 6: MSE as a. function of number of noisy train¬ 
ing examples for the unconstrained arm problem. The 
cross-validation and bootstrap bia.s-minimiza.t.ion strate¬ 
gies give a. factor of 3 improvement, over random selec¬ 
tion, and a. slight, improvement, over variance-only mini¬ 
mization. Errors are averaged over 10 runs for the boot¬ 
strap method and 15 runs for all others. One run with 
the cross-va.lida.t.ion-ba.sed method was excluded when k 
failed to converge to a. reasonable value. 

In the second series of experiments, candidates were 
drawn uniformly from a. region local to the previously 
selected query: ( 6 1 ± 0.27T, 6 2 ± 0 .1 7r ). This corresponds 
to restricting the arm to local motions. In a. constrained 
problem such as this, random sampling is a. poor strat¬ 
egy; both the bias and variance-reducing strategies out¬ 
perform it. at. least, an order of magnitude. Further, the 
bia.s-minimiza.t.ion strategy outperforms variance mini¬ 
mization by a. large margin (Figure 7). Figure 4 shows 
an exploration trajectory produced by pursuing the bias- 
minimizing criterion. It. is noteworthy that, although 
the implementation in this case was a. greedy (one-step) 
minimization, the trajectory results in globally good ex¬ 
ploration. 

5 Discussion 

I have argued in this paper that, in many situations, se¬ 
lecting queries to minimize learner bias is an appropriate 



Figure 7: MSE as a. function of number of noisy train¬ 
ing examples for the constrained arm problem. Bia.s- 
minimiza.t.ion significantly outperforms the variance- 
minimizing algorithm and random exploration. 

and effective strategy for active learning. I have given 
empirical evidence that, with a. LWR-ba.sed learner and 
the examples considered here,, the strategy is effective 
even in the presence of noise. 

Beyond minimizing either bias or variance, an impor¬ 
tant. next, step is to explicitly minimize them together. 
The boot.st.ra.p-ba.sed estimate should facilitate this, as it. 
produces a. complementary variance estimate with little 
additional computation. 5 By optimizing over both cri¬ 
teria. simultaneously, we expect, to derive a. criterion that, 
that, in terms of statistics, is truly optimal for selecting 
queries. 
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